Anh Nguyen, Amira Bendjama, Hong Doan
Introduction and Problem Statement
The field of data science has experienced remarkable growth in recent years, with organizations across diverse industries recognizing the value of data-driven decision making. According to an article by 365 Data Science, the US Bureau of Labor Statistics estimated that the employment rate for data scientists will grow by 36% from 2021 to 2031. This rate is significantly higher than the average growth rate of 5%, indicating substantial growth and demand for data science talent. The surging demand for data science presents both opportunities and challenges for job seekers, particularly recent graduates. One of the significant hurdles they face is the lack of salary transparency in the data science job market. This opacity creates uncertainty regarding compensation and hinders job seekers’ ability to negotiate fair salaries.
There are significant variations in data science salaries across different industries and locations. For instance, according to Zippia, data scientists working in the finance and technology sectors tend to earn higher salaries compared to those in other industries. Similarly, the geographical location also plays a crucial role in determining salaries. Large cities with higher concentration of tech companies and living costs such as San Francisco and New York offer higher salaries than smaller cities.
The discrepancies in data science salaries can also be attributed to various factors, including job responsibilities, experience level, educational background, and specific skill sets. A study conducted by Burtch Works, a leading executive recruiting firm, found that data scientists with advanced degrees, such as Ph.D., tend to command higher salaries compared to those with bachelor’s or master’s degrees. Similarly, professionals with expertise in specialized areas, such as machine learning or natural language processing, often earn higher salaries due to the high demand for these skills.
According to a report surveyed 1,000 US-based full-time employees, conducted by Visier, 79% of all survey respondents want some form of pay transparency and 32% want total transparency, in which all employee salaries are publicized. However, the 2022 Pay Clarity Survey by WTW found that only 17% of companies are disclosing pay range information in U.S. locations where not required by state or local laws. For the states that have pay transparency laws such as Colorado and New York, there has been a decline in job postings since the law went into effect. Some employers comply with the new laws by expanding the salary ranges, sometimes to ridiculous lengths. These statistics highlight the lack of pay transparency not only in the field of data science, but across multiple job markets. Job seekers often struggle to estimate salaries for data science positions due to the scarcity of reliable information.
To address this problem, our project aims to develop a predictive
model that estimates the salary for data science jobs. By leveraging
publicly available data and employing machine learning algorithms, we
seek to provide job seekers a better understanding of salary
expectations within the data science job market and empower them to
negotiate fair and competitive compensation packages.
Data Sources and Data preparation
#install.packages("rpart.plot")
#install.packages("ggplot2")
#install.packages("e1071")
# Install the plotly package
#install.packages("plotly")
library(ggplot2)
ds_salaries <- read.csv("ds_salaries.csv")
summary(ds_salaries)
## X work_year experience_level employment_type
## Min. : 0.0 Min. :2020 Length:607 Length:607
## 1st Qu.:151.5 1st Qu.:2021 Class :character Class :character
## Median :303.0 Median :2022 Mode :character Mode :character
## Mean :303.0 Mean :2021
## 3rd Qu.:454.5 3rd Qu.:2022
## Max. :606.0 Max. :2022
## job_title salary salary_currency salary_in_usd
## Length:607 Min. : 4000 Length:607 Min. : 2859
## Class :character 1st Qu.: 70000 Class :character 1st Qu.: 62726
## Mode :character Median : 115000 Mode :character Median :101570
## Mean : 324000 Mean :112298
## 3rd Qu.: 165000 3rd Qu.:150000
## Max. :30400000 Max. :600000
## employee_residence remote_ratio company_location company_size
## Length:607 Min. : 0.00 Length:607 Length:607
## Class :character 1st Qu.: 50.00 Class :character Class :character
## Mode :character Median :100.00 Mode :character Mode :character
## Mean : 70.92
## 3rd Qu.:100.00
## Max. :100.00
head(ds_salaries,5)
This dataset has 607 rows and 12 columns
We want to focus on “USD” currency so we keep the “salary_in_usd” column and drop “salary_currency” and “salary” column by using subset()
ds_salaries <- subset(ds_salaries, select = -c(X , salary_currency, salary))
head(ds_salaries, 5)
num_null_rows <- sum(rowSums(is.na(ds_salaries)) == ncol(ds_salaries))
print(num_null_rows)
## [1] 0
There are no null values
repeated_entries <- subset(ds_salaries, duplicated(ds_salaries))
print(repeated_entries)
## work_year experience_level employment_type job_title
## 218 2021 MI FT Data Scientist
## 257 2021 MI FT Data Engineer
## 332 2022 SE FT Data Analyst
## 333 2022 SE FT Data Analyst
## 334 2022 SE FT Data Analyst
## 354 2022 SE FT Data Scientist
## 363 2022 SE FT Data Analyst
## 364 2022 SE FT Data Analyst
## 371 2022 SE FT Data Scientist
## 375 2022 MI FT ETL Developer
## 378 2022 SE FT Data Engineer
## 386 2022 SE FT Data Engineer
## 393 2022 SE FT Data Analyst
## 394 2022 SE FT Data Analyst
## 407 2022 MI FT Data Analyst
## 439 2022 SE FT Machine Learning Engineer
## 440 2022 SE FT Machine Learning Engineer
## 444 2022 MI FT Data Engineer
## 447 2022 SE FT Data Engineer
## 448 2022 SE FT Data Engineer
## 474 2022 SE FT Data Scientist
## 528 2022 SE FT Data Analyst
## 530 2022 SE FT Data Analyst
## 537 2022 SE FT Data Analyst
## 538 2022 SE FT Data Engineer
## 546 2022 SE FT Data Engineer
## 548 2022 SE FT Data Engineer
## 552 2022 SE FT Data Scientist
## 556 2022 SE FT Data Engineer
## 567 2022 SE FT Data Analyst
## 570 2022 SE FT Data Scientist
## 572 2022 SE FT Data Scientist
## 573 2022 SE FT Data Analyst
## 575 2022 SE FT Data Scientist
## 576 2022 SE FT Data Scientist
## 577 2022 SE FT Data Scientist
## 579 2022 SE FT Data Engineer
## 588 2022 SE FT Data Scientist
## 589 2022 SE FT Data Analyst
## 593 2022 SE FT Data Scientist
## 597 2022 SE FT Data Scientist
## 598 2022 SE FT Data Analyst
## salary_in_usd employee_residence remote_ratio company_location company_size
## 218 90734 DE 50 DE L
## 257 200000 US 100 US L
## 332 90320 US 100 US M
## 333 112900 US 100 US M
## 334 90320 US 100 US M
## 354 123000 US 100 US M
## 363 130000 CA 100 CA M
## 364 61300 CA 100 CA M
## 371 123000 US 100 US M
## 375 54957 GR 0 GR M
## 378 165400 US 100 US M
## 386 132320 US 100 US M
## 393 112900 US 100 US M
## 394 90320 US 100 US M
## 407 58000 US 0 US S
## 439 189650 US 0 US M
## 440 164996 US 0 US M
## 444 78526 GB 100 GB M
## 447 209100 US 100 US L
## 448 154600 US 100 US L
## 474 140000 US 100 US M
## 528 135000 US 100 US M
## 530 90320 US 100 US M
## 537 112900 US 100 US M
## 538 155000 US 100 US M
## 546 115000 US 100 US M
## 548 130000 US 100 US M
## 552 140400 US 0 US L
## 556 160000 US 100 US M
## 567 170000 US 100 US M
## 570 140000 US 100 US M
## 572 140000 US 100 US M
## 573 100000 US 100 US M
## 575 210000 US 100 US M
## 576 140000 US 100 US M
## 577 210000 US 100 US M
## 579 100000 US 100 US M
## 588 140000 US 100 US M
## 589 99000 US 0 US M
## 593 230000 US 100 US M
## 597 210000 US 100 US M
## 598 170000 US 100 US M
There are 42 duplicate rows
# Remove duplicate rows
df <- ds_salaries[!duplicated(ds_salaries), ]
# check again
repeated_entries_new <- subset(df, duplicated(df))
print(repeated_entries_new)
## [1] work_year experience_level employment_type job_title
## [5] salary_in_usd employee_residence remote_ratio company_location
## [9] company_size
## <0 rows> (or 0-length row.names)
Adding new column to split our salaries into three groups Low , High, Medium.The approach is to use Percentiles by Dividing the dataset based on them. Hence, we are classifying salaries below the 25th percentile as “Low”, salaries between the 25th and 75th percentile as “Medium”, and salaries above the 75th percentile as “High”.
# adding new column
# Calculate the percentiles
percentiles <- quantile(df$salary_in_usd, probs = c(0.25, 0.75))
# Define the thresholds
low_threshold <- percentiles[1] # 25th percentile
high_threshold <- percentiles[2] # 75th percentile
# Create a new column based on percentiles
df$salary_classification <- ifelse(df$salary_in_usd < low_threshold, "Low",
ifelse(df$salary_in_usd > high_threshold, "High", "Medium"))
# Get top 10 job titles and their value counts
top10_job_title <- head(sort(table(df$job_title), decreasing = TRUE), 10)
top10_job_title_df <- data.frame(job_title = names(top10_job_title), count = as.numeric(top10_job_title))
top10_job_title_df
# Load the required packages
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Define custom color palette
custom_colors <- c("#FF6361", "#FFA600", "#FFD700", "#FF76BC", "#69D2E7", "#6A0572", "#FF34B3", "#118AB2", "#FFFF99", "#FFC1CC")
# Create bar plot
fig <- plot_ly(data = top10_job_title_df, x = ~reorder(job_title, -count), y = ~count, type = "bar",
marker = list(color = custom_colors), text = ~count) %>%
layout(title = "Top 10 Job Titles", xaxis = list(title = "Job Titles"), yaxis = list(title = "Count"),
font = list(size = 17), template = "plotly_dark")
# Adjust layout settings to avoid label overlap
fig <- fig %>% layout(
margin = list(b = 150), # Increase bottom margin to provide space for labels
xaxis = list(
tickangle = 45, # Rotate x-axis tick labels
automargin = TRUE # Automatically adjust margins to avoid overlap
)
)
# Display the plot
fig
Our Dataset has 4 different experience categories: - EN: Entry-level / Junior - MI: Mid-level / Intermediate - SE: Senior-level / Expert - EX: Executive-level / Director
# Create a mapping of category abbreviations to full names
category_names_experience <- c("EN" = "Entry-level",
"MI" = "Mid-level",
"SE" = "Senior-level",
"EX" = "Executive-level")
# Get the sorted experience data
experience <- head(sort(table(df$experience_level), decreasing = TRUE))
# Replace the category names with full forms
names(experience) <- category_names_experience[names(experience)]
# Calculate the percentage for each category
percentages <- round(100 * experience / sum(experience), 2)
# Define a custom color palette
custom_colors <- c("#FFA998", "#FF76BC", "#69D2E7", "#FFA600")
# Create a pie chart with cute appearance
pie(experience, labels = paste(names(experience), "(", percentages, "%)"), col = custom_colors, border = "white", clockwise = TRUE, init.angle = 90)
# Add a legend with cute colors
legend("topright", legend = names(experience), fill = custom_colors, border = "white", cex = 0.8)
# Add a title with a cute font
title("Experience Distribution", font.main = 1)
### Compnay size distribution
# Create a mapping of category abbreviations to full names
category_names_company <- c("M" = "Medium",
"L" = "Large",
"S" = "Small"
)
# Get the sorted company size data
company_size <- head(sort(table(df$company_size), decreasing = TRUE))
# Replace the category names with full forms
names(company_size) <- category_names_company[names(company_size)]
# Set the maximum value for the y-axis
max_count <- max(company_size)
# Create a bar plot with adjusted y-axis limits
barplot(company_size, col = custom_colors, main = "Company Size Distribution", xlab = "Company Size", ylab = "Count", ylim = c(0, max_count + 10))
### Salaries Distribution
# Set the scipen option to a high value
options(scipen = 10)
# Create boxplot of salaries
bp <- boxplot(df$salary_in_usd / 1000,
col = "skyblue",
main = "Boxplot of Salaries",
ylab = "Salary in Thousands USD",
notch = TRUE)
### Salaries classification Distribution
# Get the sorted salary classification data
salary_classification <- sort(table(df$salary_classification), decreasing = TRUE)
salary_classification_df <- data.frame(salary_classification= names(salary_classification ), count = as.numeric(salary_classification ))
fig <- plot_ly(
data = salary_classification_df,
x = ~reorder(salary_classification, -count),
y = ~count,
type = "bar",
marker = list(color = custom_colors),
text = ~count,
width = 700,
height = 400
)
fig <- fig %>% layout(
title = "Salary Classification Distribution",
xaxis = list(title = "Salary Classification"),
yaxis = list(title = "Count"),
font = list(size = 17),
template = "ggplot2"
)
fig
# Create a data frame with counts of experience levels by salary classification
experience_salary <- table(df$experience_level, df$salary_classification)
# Define custom colors for each experience level
custom_colors <- c("#69D2E7", "#FFA600", "#FF6361", "#FFD700")
# Create a data frame for the plot
plot_data <- data.frame(Experience = rownames(experience_salary),
Salary_Classification = colnames(experience_salary),
Count = as.vector(experience_salary))
# Convert Count column to numeric
plot_data$Count <- as.numeric(plot_data$Count)
# Create the bar plot
library(plotly)
fig <- plot_ly(data = plot_data, x = ~Salary_Classification, y = ~Count,
color = ~Experience, colors = custom_colors, type = "bar") %>%
layout(title = "Experience Level by Salary Classification",
xaxis = list(title = "Salary Classification"),
yaxis = list(title = "Count"),
font = list(size = 17),
template = "plotly_dark")
fig
Modeling
a. Support Vector Machine (SVM)
Evaluation and Results
a. Linear Regression
b. Random Forest
c. Decision Tree
Major Challenges and Solutions
Data is not updated
Data is imbalanced
Conclusion and Future Work
References
The Data Scientist Job Outlook in 2023 | 365 Data Science
Burtch-Works-Study_DS-PAP-2019.pdf (burtchworks.com)
New Visier Report Reveals 79% of Employees Want Pay Transparency (prnewswire.com)
More NA organizations plan to disclose pay information - WTW (wtwco.com)